25 research outputs found
Contrastive Feature Masking Open-Vocabulary Vision Transformer
We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an
image-text pretraining methodology that achieves simultaneous learning of
image- and region-level representation for open-vocabulary object detection
(OVD). Our approach combines the masked autoencoder (MAE) objective into the
contrastive learning objective to improve the representation for localization
tasks. Unlike standard MAE, we perform reconstruction in the joint image-text
embedding space, rather than the pixel space as is customary with the classical
MAE method, which causes the model to better learn region-level semantics.
Moreover, we introduce Positional Embedding Dropout (PED) to address scale
variation between image-text pretraining and detection finetuning by randomly
dropping out the positional embeddings during pretraining. PED improves
detection performance and enables the use of a frozen ViT backbone as a region
classifier, preventing the forgetting of open-vocabulary knowledge during
detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT
achieves a state-of-the-art 33.9 AP, surpassing the best approach by 7.6
points and achieves better zero-shot detection transfer. Finally, CFM-ViT
acquires strong image-level representation, outperforming the state of the art
on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.Comment: Accepted to ICCV 202
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles
Self-supervised tasks such as colorization, inpainting and zigsaw puzzle have
been utilized for visual representation learning for still images, when the
number of labeled images is limited or absent at all. Recently, this worthwhile
stream of study extends to video domain where the cost of human labeling is
even more expensive. However, the most of existing methods are still based on
2D CNN architectures that can not directly capture spatio-temporal information
for video applications. In this paper, we introduce a new self-supervised task
called as \textit{Space-Time Cubic Puzzles} to train 3D CNNs using large scale
video dataset. This task requires a network to arrange permuted 3D
spatio-temporal crops. By completing \textit{Space-Time Cubic Puzzles}, the
network learns both spatial appearance and temporal relation of video frames,
which is our final goal. In experiments, we demonstrate that our learned 3D
representation is well transferred to action recognition tasks, and outperforms
state-of-the-art 2D CNN-based competitors on UCF101 and HMDB51 datasets.Comment: Accepted to AAAI 201
Neural Image-based Avatars: Generalizable Radiance Fields for Human Avatar Modeling
We present a method that enables synthesizing novel views and novel poses of
arbitrary human performers from sparse multi-view images. A key ingredient of
our method is a hybrid appearance blending module that combines the advantages
of the implicit body NeRF representation and image-based rendering. Existing
generalizable human NeRF methods that are conditioned on the body model have
shown robustness against the geometric variation of arbitrary human performers.
Yet they often exhibit blurry results when generalized onto unseen identities.
Meanwhile, image-based rendering shows high-quality results when sufficient
observations are available, whereas it suffers artifacts in sparse-view
settings. We propose Neural Image-based Avatars (NIA) that exploits the best of
those two methods: to maintain robustness under new articulations and
self-occlusions while directly leveraging the available (sparse) source view
colors to preserve appearance details of new subject identities. Our hybrid
design outperforms recent methods on both in-domain identity generalization as
well as challenging cross-dataset generalization settings. Also, in terms of
the pose generalization, our method outperforms even the per-subject optimized
animatable NeRF methods. The video results are available at
https://youngjoongunc.github.io/ni
RECLIP: Resource-efficient CLIP by Training with Small Images
We present RECLIP (Resource-efficient CLIP), a simple method that minimizes
computational resource footprint for CLIP (Contrastive Language Image
Pretraining). Inspired by the notion of coarse-to-fine in computer vision, we
leverage small images to learn from large-scale language supervision
efficiently, and finetune the model with high-resolution data in the end. Since
the complexity of the vision transformer heavily depends on input image size,
our approach significantly reduces the training resource requirements both in
theory and in practice. Using the same batch size and training epoch, RECLIP
achieves highly competitive zero-shot classification and image-text retrieval
accuracy with 6 to 8x less computational resources and 7 to 9x fewer FLOPs than
the baseline. Compared to the state-of-the-art contrastive learning methods,
RECLIP demonstrates 5 to 59x training resource savings while maintaining highly
competitive zero-shot classification and retrieval performance. Finally, RECLIP
matches the state of the art in transfer learning to open-vocabulary detection
tasks, achieving 32 APr on LVIS. We hope this work will pave the path for the
broader research community to explore language supervised pretraining in
resource-friendly settings.Comment: Published at Transactions on Machine Learning Researc